From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings

نویسندگان

  • Johannes Bjerva
  • Isabelle Augenstein
چکیده

A core part of linguistic typology is the classification of languages according to linguistic properties, such as those detailed in the World Atlas of Language Structure (WALS). Doing this manually is prohibitively time-consuming, which is in part evidenced by the fact that only 100 out of over 7,000 languages spoken in the world are fully covered in WALS. We learn distributed language representations, which can be used to predict typological properties on a massively multilingual scale. Additionally, quantitative and qualitative analyses of these language embeddings can tell us how language similarities are encoded in NLP models for tasks at different typological levels. The representations are learned in an unsupervised manner alongside tasks at three typological levels: phonology (grapheme-to-phoneme prediction, and phoneme reconstruction), morphology (morphological inflection), and syntax (part-of-speech tagging). We consider more than 800 languages and find significant differences in the language representations encoded, depending on the target task. For instance, although Norwegian Bokmål and Danish are typologically close to one another, they are phonologically distant, which is reflected in their language embeddings growing relatively distant in a phonological task. We are also able to predict typological features in WALS with high accuracies, even for unseen language families.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

What is Phonological Typology?

In this talk I am concerned with the following questions: 1. What is phonological typology? 2. How are phonological typology and phonetic typology the same/different? 3. How are phonological typology and general phonology the same/different? 4. How are phonological typology and general typology the same/different? Despite earlier work by Trubetzkoy, Jakobson, Martinet, Greenberg and others, and...

متن کامل

Code-Copying in the Balochi Language of Sistan

This empirical study deals with language contact phenomena in Sistan. Code-copying is viewed as a strategy of linguistic behavior when a dominated language acquires new elements in lexicon, phonology, morphology, syntax, pragmatic organization, etc., which can be interpreted as copies of a dominating language. In this framework Persian is regarded as the model code which provides elements for b...

متن کامل

Linguistic Typology and Formal Grammar

The goal of this chapter is to provide an overview of the relationship between linguistic typology and formal grammar—a relationship that has existed for several decades now and is unlikely to disappear any time soon. As the reader will see, the two orientations differ in a number of respects, but they share the custody of language, and that motivates the need for communication between the two....

متن کامل

Experiments in Unsupervised Learning of Natural Language

Linguistics has invented and discarded many theories of language, and there are currently many competitors to the basic idea of phrase structure grammars as capturing the syntactic structure of language. Computational Linguistics has proven to be a testing ground for theories and grammars, and is similarly diverse. Moreover recently we have learnt that the similar principles and techniques may ...

متن کامل

Cross-linguistic Influence at Syntax-pragmatics Interface: A Case of OPC in Persian

Recent research in the area of Second Language Acquisition has proposed that bilinguals and L2 learners show syntactic indeterminacy when syntactic properties interface with other cognitive domains. Most of the research in this area has focused on the pragmatic use of syntactic properties while the investigation of compliance with a grammatical rule at syntax-related interfaces has not received...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1802.09375  شماره 

صفحات  -

تاریخ انتشار 2018